7 research outputs found

    Theoretically-Efficient and Practical Parallel DBSCAN

    Full text link
    The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take O(nlogn)O(n\log n) work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case, making them inefficient for large datasets. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with hyper-threading show that we outperform existing parallel DBSCAN implementations by up to several orders of magnitude, and achieve speedups by up to 33x over the best sequential algorithms

    IASTED Int. Conf. on Databases and Applications (DBA'04) A Quality Measure for Distributed Clustering

    No full text
    Clustering has become an increasingly important task in modern application domains. Mostly, the data are originally collected at different sites. In order to extract information from these data, they are merged at a central site and then clustered. Another approach is to cluster the data locally and extract suitable representatives from these clusters. Based on these representatives a global server tries to reconstruct the complete clustering. In this paper, we discuss the complex problem of finding a suitable quality measure for evaluating the quality of such a distributed clustering. We introduce a discrete and continuous quality criterion which we empirically compare to each other

    M.: Visual Mining of Cluster Hierarchies

    No full text
    Abstract. Similarity search in database systems is becoming an increasingly important task in modern application domains such as multimedia, molecular biology, medical imaging, computer aided engineering, marketing and purchasing assistance as well as many others. In this paper, we show how visualizing the hierarchical clustering structure of a database of objects can aid the user in his time consuming task to find similar objects. We present related work and explain its shortcomings which led to the development of our new methods. Based on reachability plots, we introduce approaches which automatically extract the significant clusters in a hierarchical cluster representation along with suitable cluster representatives. These techniques can be used as a basis for visual data mining. We implemented our algorithms resulting in an industrial prototype which we used for the experimental evaluation. This evaluation is based on real world test data sets and points out that our new approaches to automatic cluster recognition and extraction of cluster representatives create meaningful and useful results in comparatively short time.
    corecore